Understanding Search Trees via Statistical Physics 
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O 

. We study the random m-ary search tree model (where m stands for the number of branches of a 

search tree) , an important problem for data storage in computer science, using a variety of statistical 
physics techniques that allow us to obtain exact asymptotic results. In particular, we show that the 
probability distributions of extreme observables associated with a random search tree such as the 
height and the balanced height of a tree have a traveling front structure. In addition, the variance 
of the number of nodes needed to store a data string of a given size N is shown to undergo a striking 
phase transition at a critical value of the branching ratio m c = 26. We identify the mechanism of 
this phase transition, show that it is generic and occurs in various other problems as well. New 
results are obtained when each element of the data string is a D-dimensional vector. We show that 
O ' this problem also has a phase transition at a critical dimension, D c — it/ sin -1 (l/y/8) = 8.69363.... 
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'Search Trees' are the objects of key interest in an important area of computer science called 'Sorting and Searching' 
[1] which deals with the basic question: How does one store the incoming data to a computer in an efficient way so 
that one spends the minimum time in searching a given data element if required later? Amongst various search 
algorithms, the tree based sorting and search algorthims turn out to be the most efficient ones. One of the simplest 
such algorithms is the so called 'binary search algorithm' (BSA) which can be understood by the following simple 
example. Consider a data string consisting of N elements which are labelled by the N integers: {1,2,..., N}. These 
could be the months of the year or the names of people etc. Let us assume that this data appears in a particular 
order, say {6, 4, 5, 8, 9, 1, 2, 10, 3, 7} for A^ = 10 integers. This data is first stored on a binary tree following the simple 
dynamical rule: the first element 6 is stored at the root of the tree (see Fig. 1). The next element in the string is 
4. We compare it with 6 at the root and since 4 < 6, we store 4 in the left daughter node of the root. Had it been 
bigger than the root 6, we would have stored it in the right daughter node. The next element in the string is 5. We 
again start from the root, see that 5 < 6, so we go to the left branch. There we encounter 4 and we find 5 > 4, so we 
go the right daughter node of 4. This process is continued till all the N — 10 elements are assigned their nodes and 
we get a unique binary search tree (BST) (see Fig. 1) for this particular data string {6, 4, 5, 8, 9, 1, 2, 10, 3, 7}. 




h=3 BALANCED HEIGHT 



HEIGHT 

FIG. 1. The binary search tree associated with the data string {6, 4, 5, 8, 9, 1, 2, 10, 3, 7}. 

Once the data is stored on the tree, it takes very little time to search a required element. For example, suppose we 
are looking for the element 7. We start from the root and comparing with 6 at the root, we know that 7 must be on 
the right branch since 7 > 6. We then go down one level and next compare 7 with 8 (see Fig. 1) and since 7 < 8, 
we look in the left subtree below 8 and immediately find 7. Thus, by construction, we eliminate searching one half 
of the subtrees at every level. This makes the search process very efficient. In fact, typical search time to find an 
element is i S carch = D where D is the depth of the element in the tree. Since, roughly speaking, 2 D ~ N, one gets 
^search ~ OQogN), which is far better than linear search that takes i S earch ~ O(N). 
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An immediate generalization of a BST is an m-ary search tree where the tree has m branches. The BST corresponds 
to m = 2. An m-ary search tree is constructed in the following way. Each node of the tree can now hold at most 
(m — 1) elements. One first collects the first (m — f ) elements of the data string and stores them together in the 
root of the tree in an ordered sequence x\ < x 2 ■ ■ ■ < x m -i [see Fig. 2 for m = 3]. Next when the m-th element 
arrives, one compares it first with x\. If x m < x\, the new element X'ui IS cLSSl gned to the leftmost daughter node of 
the root. If x\ < x m < x 2} x m goes to the daughter node in the second branch and so on. Each subsequent incoming 
element is assigned to either of the m branches according to this above rule. As an example, the same data string 
{6, 4, 5, 8, 9, 1, 2, 10, 3, 7} of size N = 10 is stored on a m = 3 tree in Fig. 2. Note that for m > 2, some of the nodes 
of the tree are saturated to their capacity, i.e., are fully occupied with (m — 1) elements, while some others are only 
partially occupied. 




no. of occupied nodes: n-7 
FIG. 2. The m = 3 search tree associated with the data string {6, 4, 5, 8, 9, 1, 2, 10, 3, 7}. 

Once an m-ary tree is constructed, one can define a number of observables associated with the tree which provides 
information about the structure of the tree. The knowledge of how these observables depend on the data size N is of 
central importance in 'sorting and searching'. Amongst many observables, we focus here on 3 central objects: 

1. the height Hjy of the tree which is defined as the distance of the farthest node from the root. For example, in 
Fig. 2, we have H = 3. The height Hn measures the maximum possible time to search an element, i.e., it is a 
measure of the worst case scenario. 

2. the balanced height of the tree, defined as the maximum depth upto which the tree is balanced, i.e., all the 
nodes upto that level are at least partially occupied. In the example of Fig. 2 we have h — 2. Balancing a tree 
is important for optimizing search algorithms and hence Hn is an important observable. 

3. the number of non-empty nodes tin of the tree which tells us how many nodes typically one needs to store a 
data of size N. For example, in Fig. 2, one has n = 7. Note that for the binary case m = 2, one has trivially 
un — N since each node can contain only one element. However, for m > 2, tin becomes a variable since some 
of the nodes may only be partially filled. 

Usually the data arrives at a computer in random order. To study this situation, one considers the simplest model 
called the 'random m-ary search tree model' (RmST) where one assumes that the incoming data string can arrive 
in any of the AH possible order or sequence, each with equal probability. For each of these sequences, one has an 
m-ary tree and the associated observables Hn, hjy and Hn- As the sequence changes, the corresponding tree changes 
and hence these observables also take on different values. For example, in Fig. 3, we show two sequences, their 
corresponding m = 3 trees and the values of the 3 observables. The central question of importance is: given that all 
the N\ sequences occur with equal probability, what are the statistics of H N , Hn and n N l For example, what are the 
averages, variances or even the full probability distributions of these observables? 

{6, 4, 5, 8, 9, 1, 2, 10, 3, 7} {8, 6, 9, 2, 1, 5, 3, 4, 7, 10) 




H=3, h=2, n=7 H=4, h=2, n=6 

FIG. 3. The m = 3 search trees associated respectively with the data strings {6,4,5,8,9,1,2,10,3,7} and 
{8,6,9,2,1,5,3,4,7,10}. 



2 



The statistics of Hn and Jin have been studied by computer scientists over the past two decades and many nontrivial 
results have been found [2-5]. For example, the average height and the average balanced height of a random m-ary 
search tree have the following asymptotic behaviors for large N, 

(H N ) w a m log N + b m log(log N) + ... 

(h N )^c m \ogN + d m \og(logN) + .... (1) 

While the leading log(iV) behavior was proved by Devroye [4] who also computed the coefficients a m and c m , the 
sublcading double logarithmic behavior was conjectured only recently by Hattori and Ochiai [6], who found hi w —1.9 
numerically. Also, the variance and even the higher moments of Hn and hx were found to be independent of N for 
large N [7,8]. 

The study of the statistics of un 1 on the other hand, is relatively recent [9-11]. Chern and Hwang recently found 
[10] that while the average /in — (tin) ~ N for large N (as one should expect), the variance v(N) = ((tin — /i(iV)) 2 ) 
undergoes a striking phase transition as a function of m. They found that 

v(N) ~ N for to < 26 

^ N 26(m) for m>2 Q j (2) 

where the exponent 9(m) > 1/2 depends on to for to > 26. 

The various important results mentioned above were derived by the computer scientists using sophisticated proba- 
bilistic methods which, though rigorous, are often not simple. As physicists, one would like to understand and derive 
these results in a physically transparent way. Moreover, as it often happens, a physical approach has the advantage 
that it can make links with other problems and also the generalization often becomes easier. In a series of recent 
papers [12-15], we were able to build up a statisical physical approach to the RmST problem which not only allowed 
us to rederivc many asymptotically exact results (known previously only via rigorous probabilitic methods) in a phys- 
icaly transparent way, but also led to many new results, generalizations and links to other problems. For example, we 
were able to generalize our results to other search trees such as the 'digital search trees' (DST) (which has links to 
the Lempel-Ziv data compression algorithm) and found an exact mapping between the DST and the problem of the 
directed diffusion limited aggregation (DLA) problem on the Bethe lattice [16]. The latter problem was first studied 
by Bradley and Strenski numerically [17] and remained unsolved for many years. Our approach provides an exact 
asymptotic result for this DLA problem [16]. 

Our strategy was to first map the RmST problem to a random fragmentation problem which was more amenable 
to statistical physical analysis. The main new discovery was that the distributions of the height Hm and the balanced 
height Hn, which are 'extreme' variables, have a 'traveling front' structure. The 'traveling fronts' appear in many 
physics and biology problems and have been well studied over the past few decades [18]. The techniques developed 
in analysing traveling fronts were then useful to derive many asymptotically exact results for the RmST problem. 
Subsequently, we found that in many problems where one is interested in finding the statistics of extreme variables, 
there is often a 'traveling front' structure [19,20]. 

For the number of non-empty nodes n^, which is not an extreme variable, a different statistical physics approach 
(equivalent to a backward Fokker-Planck method) was used which allowed us to understand the mechanism of the 
phase transition, the significance of the critical number 26 and calculate the exponent 9(m) exactly [15]. We were also 
able to show that this phase transition is rather generic and occurs in other problems as well. Our approach allowed 
us to generalize to the case when the data string consists of N D-dimensional vectors. For example, we found that 
there is again a phase transition at a critical dimension D c — tt/ sin -1 (l/\/8) = 8.69363 .... In the next few sections 
we outline our approach and state the main results. 

II. MAPPING TO A FRAGMENTATION PROCESS 

Our strategy is to first map the problem of RmST to a random fragmentation problem [13,15], which in some sense, 
is more familiar to physicists. This fragmentation procedure can then be viewed as a dynamical process and one can 
write down its evolution equation fairly easily. This mapping is best understood in terms of an example. Let us take 
our favorite data string {6, 4, 5, 8, 9, 1, 2, 10, 3, 7} and store it on an m = 3 tree as in Fig. 2 and also shown in the left 
half of Fig. 4. 
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TREE CONSTRUCTION FRAGMENTION PROCESS 




FIG. 4. The m = 3 search tree associated with the data string {6, 4, 5, 8, 9, 1, 2, 10, 3, 7} and the corresponding fragmentation 
process. 

In the fragmentation problem, one starts with a stick (or interval) of length N = 10. Once the first two elements 4 
and 6 are stored in the root of the tree, the remaining elements will belong either to the interval [1 — 3], [5], or [7— 10], 
which arc subsequently completely disconnected from each other. Thus storing the first two elements is equivalent, 
in the fragmentation problem, to breaking the original interval [1 — N] of length TV into 3 smaller intervals [1 — 3], 
[5], and [7 — 10]. The two break points 4 and 6 are chosen uniformly from the N points {1, 2, 3 ... , N} in the RmST 
problem (shown by the arrows in Fig. 4). Next, when the element 5 arrives in the tree, it corresponds to breaking 
the interval containing the element 5 randomly into 3 parts (this breaking is not shown explicitly in Fig. 4). The 
process is then repeated for other elements. Note that in the fragmentation problem, an interval breaks iff there is a 
an element (shown by black dots) inside the interval. Thus there is a threshold phenomenon: if the length of a stick 
is too small so that it doesn't have an element (black dot) in it, one doesn't fragment it any more. We denote this 
thcshold length by No (In our example, N n = 1). It just sets the unit of length and its actual value is not important 
for the asymptotic large N analysis. Those intervals which still have black dots in them (and thus have lengths > No) 
are thus 'alive' and will fragment subsequently, but those whose lengths are < N are 'dead'. Thus, when all the N 
elements are stored in the tree, all the intervals in the corresponding fragmentation problem become 'dead'. 

Note that, in the fragmentation problem, at each step (shown by different levels on the right in Fig. 4) there is only 
one 'splitting event'. Each time an interval splits, it corresponds to storing in a node on the tree. Thus completing a 
tree is equivalent to ending one 'history' of the fragmentation process (at the end of which all intervals are 'dead'). 
Evidently, the number of non-empty nodes njy in the tree is exactly same as the total number of 'splitting events' in 
the history of the fragmentation process (for example, in Fig. 4 the number of nodes on the tree and the number of 
splitting events are both 7). Let Zj's denote the lengths of intervals in the fragmentation problem at a given stage. 
One can then set up a dictionary between the two problems [13,15] and it is easy to see that 

1. Height Hm- Prob[_ffjv < n]= Prob[ l\ < 1, I2 < 1, . . . after n steps of fragmentation.] 

2. Balanced height hn'- Prob[/ijv > n]=Prob[ h > 1, h > 1, ■ • • after n steps of fragmentation.] 

3. Number of non-empty nodes n^- Prob[nAr = n]= Prob[there are a total of n 'splitting events' till the end of 
the fragmentation process.] 

III. ANALYSIS OF THE FRAGMENTATION PROBLEM 

Once this dictionary is set up, one can forget about the original tree problem and focus on the fragmentation 
problem. For simplicity, we will also assume that the lengths of sticks in the fragmentation problem are continuous 
variables. This is because the original discrete problem and the continuous problem will have the same asymptotic 
properties for large N, but the continuous problem is easier to handle. Thus, in the continuous problem, we start with 
a stick of length N where N is large. We break it randomly into m fragments of lengths riN, r 2 N , . . ., r m N where 
the fractions r^s are random numbers between [0, 1] that satisfy the length conservation condition, Y^lLi r i = 1- At 
this point, we will consider a general problem where the fractions r^'s are drawn from a normalized joint distribution 
r](ri, r 2 , . . . , r m ). The RmST problem would correspond to a specific choice of this joint distribution. Note that in the 
RmST problem, all the N\ permutations of the original sequence occur equally likely. This means that the first (m — 1) 
elements are random, each drawn independently and uniformly from [1 — AT]. In the fragmentation language, this means 
that each of the fractions n, r 2 , . . ., r m _i is chosen from a uniform distribution between and 1 and then one sets, 
r m = 1 — (n +r 2 + . . . + r m _i). This leads to the normalized joint distribution 77(7-1, r 2 , ■ ■ ■ , r m ) = (m— 1)IS(J2T Ti ~ -0 
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[13]. One of the advantages of our method is that it allows us to obtain exact results for arbitrary joint distribution 
of the fraction n's, not necessarily only for the uniform case. The RmST problem just corresponds to a special case. 



N 



t 1 t t 



t t t I t t t t H t t 

~i~ ftT 

FIG. 5. The fragmentation process with continuous lengths for m = 5. The arrows denote the break points. 

After the first spliting event, we examine the lengths of each of the m fragments. If the length of a fragment 
is already less than Nq = 1, we proclaim it 'dead' and it doesn't split any further. Those fragments with lengths 
> No — 1 are 'alive' and each of those 'alive' fragments is further split into m pieces by drawing, for each piece 
independently, a set of fractions rj's from the identical joint distribution r\ ({r»}) = J](ri, r 2 , . . . , r rn ). This process is 
then repeated till all the intervals become 'dead', i.e., their lengths become < No = 1. A pictorial representation is 
given in Fig. 5 with m = 5. 

For subsequent analysis, it is useful to define the 'marginal' distribution r}(ri) of any one of the fractions as, 

= J v({ri})dr 1 ...dr i - 1 dr i+1 ...dr m . (3) 

For simplicity, we will assume isotropy, i.e., rj(ri) = T](r) is independent of the index i and is thus the same for each 
fragment. For example, for the RmST problem, one easily gets [13] 

ri(r) = (m- 1)(1 - r) m - 2 . (4) 

for < r < 1. Note that for binary trees m — 2, where one breaks a stick into two pieces, one gets r](r) = 1 for 
< r < 1, the usual uniform distribution for the break point. 



A. The Height and the Balanced Height 

Let us denote the cumulative height distribution Prob[i?jv < n] by P(n,N). Using the dictionary outlined before, 
we have P(n,N)= Prob [Zi < 1, l 2 < 1, . . . after n steps of fragmentation starting with the initial length N] where 
Zj's are the lengths of the intervals. It is then easy to set up a recursion satified by P(n,N) for the fragmentation 
process. Consider the first splitting where we have m new intervals of lengths riN, r 2 N, . . ., r m N. Each of these new 
pieces will have subsequent histories of evolution completely independent of each other. Hence, it follows 

V({n})dr 1 dr 2 . . .dr m , (5) 

satisfying the condition, P(n, 1) = 1 for all n > 1 (this follows from the fact that if the initial length is 1, after the 
first splitting all the lengths will be < 1). The equation (5) is reminiscent of a backward Fokker-Planck equation. It 
is further useful to make a change of variables, t = log(iV) and ej = — log(rj). The joint distribution of ej's are given 
by V ({e»}) IL de i = V ({ r i}) Hi dr i- Tncn tn c Eq. (5) reduces to, 

f) ({ei}) de 1 de 2 . ■ . de m . (6) 

The Eq. (5) (or equivalently Eq. (6) is nonlinear and hence is difficult to solve exactly. However, if one plots the 
numerical solution of Eq. (5), one finds a traveling front structure as shown in Fig. 6. 



P(n,N) = J 



Y[P(n-l,nN) 



P(n,t) 



Y[p(n-l,t-ei) 
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INCREASING log(N) 

TRAVELLING FRONT 



n 

FIG. 6. The traveling front structure of the solution of Eq. (5). 

This means that the solution at late 'times' t has the structure, P(n, t) ~ / (n — rif(t)), where n/(i) is the position 
of the front at 'time' t. Note that the front retains its shape as t increases which indicates that the width of the front 
remains of 0(1) even as t — > oo. The front position advances with a uniform velocity, i.e., nf(t) vt, to leading order 
for large t where the velocity v is yet to be determined. We substitute P(n,t) = 1 — F(n — vt) in Eq. (6) and then 
focus near the large n tail where F is small and hence one can linearize the equation to get 

/•OO 

F(x) =m F(x - 1 + ve)fj(e)de, (7) 
Jo 

where fj(e)de — rj(r)dr is the effective induced distribution associated with any one of the fractions. This linear 
equation clearly admits an exponential solution F(x) = e~ Xx provided A is related to v via the dispersion relation, 

1 = me x / e- Xve fj(e)de. (8) 
Jo 

Thus, in principle, one can have a whole family of possible velocities v(X) parametrized by A. However, in practice, 
the front has a unique velocity. So, how does one select this unique velocity from a continuous one parameter family 
of possible velocities? It turns out that the solution v(X) of Eq. (8) is a nonmonotonic function of A with a single 
minimum at A = A* that depends on the distribution 77(e). According to the velocity selection principle developed 
in the traveling front literature [18,20], the front always chooses this minimum velocity v(X*) as long as the initial 
condition is sharp enough. Thus the leading front position is given by nf(t) w v(X*)t where v(X*) is obtained by 
minimizing v(X) in Eq. (8) with respect to A. Moreover, it turns out that the leading front position has an associated 
slow logarithmic correction [18], 

n f (t)=v(X*)t-^log(t) + .... (9) 

Note that since Piob[H N < n] = P{n,N), the expected height (H N ) = £Jl - P(n,N)] w n f (t) where t = log(iV). 
This follows from the fact that the front rises sharply from for n < nf(t) to 1 for n > Thus, the summation 

5^ n [l — P(n, N)] can be replaced by the front location nf(t). Using Eq. (9), we then get 

(H N ) = v(Y) log N - JL log (log(AO) + . . . (10) 

This then provides a physical derivation of the result in Eq. (1) where we identify the constant a m — v(X*) with 
the velocity of the front and the constant b m — — 3/2A* as the prefactor of the correction term. Note that our result 
is more general than the RmST (which is just a special case where the break points are chosen uniformly). Our 
derivation also provides a proof for the double logarithmic form of the the correction term previously only conjectured 
by Hattori and Ochiai [6]. 

For the RmST problem, we have rj{r) from Eq. (4). This gives, 77(e) = (to — 1)(1 — e~ c ) m ~ 2 e -£ . Substituting this 
in Eq. (8), we get the dispersion relation, 

m(m- l)e x B(Xv + l,m- 1) = 1, (11) 

where B(m,n) is the standard Beta function. For example, for the binary case m = 2, one gets from Eq. (11), 
v(X) = (2e x - 1)/A which has a minimum at A* = 0.76804 . . . with v(X*) = 4.31107 . . .. One then gets for m = 2 an 
exact result, 

(H N ) = 4.31107 ... log TV - 1.95303 ... log (log(iV)) + . . . (12) 

Similarly, one can derive the exact asymptotic behavior for all m and for arbitrary fraction distribution rj(r) [13]. 
Note that for the binary case m — 2, the same double logarithmic correction term was also found by Reed using 
rigorous probabilistic methods [21], but our results seem to be more general. 
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For the balanced height tiN, the analysis is similar. The cumulative probability Q(n,N) = Prob[ft,]v > n] satisfies 
exactly the same recursion relation as in Eq. (5), except the initial condition is different [13]. One has, Q(n, 1) = 1 
for n < 1 and Q(n, 1) = for n > 1. Again, the solution has a traveling front structure, except now it has a [1 — 0] 
front as opposed to the [0 — 1] front in the height case. Proceeding along the same path, one obtains the asymptotic 
front position and hence the average balanced height, 

(h N ) = v(X*) log N + log (log(AO) + • • • (13) 
where v(\*) is determined by maximizing v(X) obtained from the dispersion relation, 

1 = me- x / e +Xve fi(e)de. (14) 
Jo 

Note that this dispersion relation is the same as in Eq. (8) provided one changes the sign of A. This reflects the so 
called 'duality' between the height and the balanced height [13]. For the m = 2 binary case, we get from Eq. (14), 
v(X) = (1 - 2e~ A )/A which has a maximum at A* = 1.67835 ... and v(X*) = 0.373365 . . .. This gives [13], 

(h N ) = 0.373365 ... log TV + 0.89374 ... log (log(JV)) + . . . (15) 

Note that the sign of the correction term is different in Eqs. (12) and (15). Similarly, one can derive exact asymptotic 
results for all to as well for any arbitrary distribution rj(r). 



B. Number of Non-empty Nodes 

We now turn to the statistics of the number of non-empty nodes un required to store a data string of size TV. Once 
again, the fragmentation representation turns out to be useful. One can easily write down a recursion relation for tin 
by noting that un is just the total number of spilitting events in the fragementation process till it stops, given that 
it started with an initial stick of length N. After the first spiffing one has to pieces of lengths r\N, r 2 N, . . ., r m N 
whose subsequent histories are completeley independent of each other. Note that an interval splits iff its length is 
> N where iV is the threshold length. Evidently, if the starting length N < N , njv = since there would not be 
any splitting. However, if N > N , one can write a recursion [15], 

n N = n riN + n T2 N + • • • + n TmN + 1, (16) 

where the fractions r^'s are again random numbers satisfying Y^iLi r i — 1 that are drawn from a joint distribution 
77 ({r"j}). The term 1 on the right hand side of Eq. (16) just counts the first splitting and the rest of the terms count 
the total number of subsequent splitting events arising from each of the m pieces generated after the first splitting. 
The = symbol represents 'equivalence in law', i.e., the left and the right hand side of the = symbol have the same 
probability distribution. 

Taking average on both sides of Eq. (16), one finds that the average number of nodes or the 'splitting events' 
/J,(N) = (un) satisfies an integral equation [15] 

n{N)=mj n{rN)r){r)dr + l. (17) 

JNa/N 

This integral equation can be solved exactly [15]. One finds that, /u(iV) = g(N/N ) where the scaling function g(z) is 
given by 

00 

g(z) = a + a 1 z + ^2a k z Xk , (18) 

fc=2 

where A^'s are the roots of the following equation with Re{\k) < 1 , 

to ( r x rj(r)dr = 1. (19) 
Jo 
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Note that A = 1 is always a root of Eq. (19). This follows from the following observation. By averaging the sum rule, 
Y^iLi r i — 1 one gets ( r ) = l/ m which shows that A = 1 is always a solution of Eq. (19). In fact, the linear term in 
Eq. (18) corresponds to this root at A = 1. Furthermore, one can prove that all the others roots A^'s are complex, if 
Afc is a root its complex conjugate A£ is also a root and all these other roots lie in the complex A plane to the left of 
the imaginary line at A = 1 + iz, i.e., Re(Xk) < 1. The leading behavior of the average for large N is given by the 
linear term and one gets, l-i(N) <~ a\N/N where 

ai 1 . (20) 

m / rlog(r)r)(r)dr 

For the RmST problem, we have rj(r) = (to — 1)(1 — r) m_2 which gives, ct\ — 1/J2T=2 V 5 - 



For the variance v(N) = {(n N — fi(N)) 2 ), one can similarly write down a recursion relation [15] starting from Eq. 
(16), 

u(N) =m f u(rN)rj(r)dr + J, (21) 
Jn /n 

where J represents a 'source' term that depends on the form of the first moment fJ,(N). More precisely, if S = 
^3™ 1 /i(riiV), then J = ((S — (S)) 2 ). The significant fact about this problem is that the equation for the second 
moment 'closes' in the sense that it involves only second and first moments, but not higher moments. It does not 
have the usual hierarchy problem that one often encounters in statistical mechanics problem. This fact makes this 
problem analytically tractable. This source term J also turns out to be responsible for driving the 'phase transition' 
in the variance. This is a new mechanism of phase transition that one has not encountered before in other problems. 

Using the exact solution for the first moment n(N) from Eq. (18), one can evaluate the source term J which turns 
out to be only a function of z = N/N and for large z one gets, 

J(z) w /3iz 2Aa + f3 2 z 2 ^ + [3 a z X2+X ' 2 +■■■ (22) 

where A2 (and its complex conjugate A2) are the nearest zeros of the equation, m J r x rj(r)dr = 1 to the left of the 
line Re(X) = 1 in the complex A plane. Substituting this asymptotic behavior of J(z) in Eq. (21) and solving the 
integral equation, one finds that v(N) — Y(N/N Q ) where the asymptotic behavior of Y(z) for large z depends on the 
value of J?e(2A2). One finds that as z — > 00, Y{z) ~ z (as in the case of the first moment) provided i?e(2A2) < 1. In 
this case, the source term J turns out to be insignificant and gives rise only to subleading correction terms. However, 
if Re(2X 2 ) > 1, the source term J(z) becomes significant and controls the asymptotic behavior of Y(z) and one gets, 
Y(z) ~ z 29 where 6 = Re{\ 2 ). 

Note that the root A2 is a function of to. As one tunes to, A2 changes but always stays to the left of the line 
Re(X) = 1 in the complex A plane. However, for small to, Re(2X 2 ) < 1, i.e., A2 stays to the left of the line Re(X) = 1/2. 
Then as to exceeds a critical value m c , A2 crosses the line Re(X) = 1/2 from its left to its right and i?e(2A2) > 1, 
leading to a phase transition in the large TV behaviour of v(N). Thus the critical value of m c is determined from 
the condition, Re(X 2 ) = 1/2. For the RmST problem, substituting rj(r) = (m — 1)(1 — r) m ~ 2 in Eq. (19) one gets, 
m(m — l)B(m — 1, A + 1) =0. One then obtains A 2 using the Mathematica. Setting Re(\2) = 1/2 determines the 
critical value, m c = 26.0561 . . .. Note that, once we have written down the moment equations, to can be treated as a 
continuous parameter, even though in actual search trees to is always an integer. We thus get a very general result, 

u{N) ~ N for m < m c 

^ N 2e(m) for m>mo (23) 

for arbitrary breaking distribution rj(r) where m c is determined from Re{\2) = 1/2 and 6(m) — Re{\ 2 ) where A2 is 
determined from Eq. (19). For the RmST case in particular, we get m c = 26.0561 . . .. 

Thus, we have identified a simple mechanism (driven by the source term) of a rather striking and nontrivial phase 
transition in a generic fragmentation problem [15]. There is a physical meaning associated with this phase transition. 
For m < to c , the fluctuation (variance) in the number of splitting events scales as N for large N and the central limit 
theorem holds. In fact, one finds that the full distribution of njy is Gaussian for m < m c . However, for to > m c , 
rare events give rise occasionally to huge fluctuations. In the language of the fragmentation problem, note that the 
effective distribution of the fraction r/(r) = (m — 1)(1 — r) m ~ 2 gets highly localized around r = for large to. This 
means that for large to, most of the to fragments have very tiny lengths (which thus become 'dead') except one which 
has a huge length (due to the length conservation condition, YmL\ = 1). Thus this large piece will persist for a long 
time and one will get a huge number of splitting events. This qualitative argument, of course, does not explain why 
there is a sharp phase transition. For that, one has to carry out explicit calculations as done here. 
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IV. GENERALIZATION TO VECTOR DATA STRING 



So far, we have considered the storing of a data string of size N on a tree where each element of the data is a 
scalar. A natural generalization of this is when the data consists of a string of N D-dimensional vectors. For example, 
suppose we have the following data of 2-dimensional vectors: {(6, 4), (4, 3), (5, 2), (8, 7), . . .}. How do we store this 
data on a tree? The corresponding tree is known as a quad-tree in the computer science literature [22]. To store 
this data, one imagines a N x N square. The first key (6, 4) is stored at the co-ordinate (6, 4) of this square and 
it forms the root of the tree. This root has now 4 branches corresponding the 4 quadrants around the point (6,4). 
Note immediately the analogy to a corresponding fragmentation problem, the storing of the first vector corresponds 
to fragmenting the original N x N square into 4 rectangles which join each other a (6,4) (see Fig. 7). This is the 
generalization of breaking a one dimensional stick in the scalar case. Since both the components 6 and 4 are chosen 
independently and randomly from the set {1,2,3,..., N}, this becomes a random fragmentation problem where the 
side lengths of any one of the 4 rectangles are chosen uniformly from the interval [0 — N] . 



t 





(6, 4) 


(4, 3) 








+ 



splitting due to (6, 4) 
splitting due to (4, 3) 

QUAD-TREE 



1 2 3 4 5 6 7 N, 

FIG. 7. The storing of (6, 4) and (4, 3) on a quad tree 



square fragmentation process. 



The next element (4, 3) arrives and storing (4, 3) is equivalent to the fragmentation of the rectangle containing the 
new point (4, 3) into 4 further smaller rectangles. This process continues till all the data is stored, i.e., when the areas 
of all the rectangles become smaller than some threshold value A = 1. One immediately sees the generalization to 
the case where each data element is a D-dimensional tuple. In the corresponding fragmentation problem, one starts 
with a D-dimensional cuboid of side lengths N and the arrival of each data corresponds to fragmenting a cuboid into 
2 D number of smaller cuboids. Note that D = 1 corresponds to the binary search tree of the scalar data, discussed 
before. 

Following similar routes as in the m-ary search tree case, we were able to determine the exact asymptotic properties 
of the height H^, the balanced height /ijv and the number of non-empty nodes % of a D-dimensional quad- tree. 
We just mention our main results here without providing details since they are similar to the earlier cases. For the 
extreme variables such as the height Hn and the balanced height h^, we again find a traveling front structure whose 
analysis provides us with the following exact asymptotics for large N, 



1 O^QflQ 

(H N ) = 4.31107 . . . log N - " - log(L> log N) + . . . 
(h N ) ss 0.373365 . . . log N + ' - — \og(D log N) + . 



(24) 



Surprisingly, the leading behavior (especially the coefficients of log(TV) terms) turns out to be independent of the 
dimension D. Besides, due to the existence of a traveling front structure, one immediately finds that the all the 
higher moments including the variance of Hn and Iin are bounded <~ O(l) for large N. 
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FIG. 8. The distribution of the number of splittings of a cuboid with sidelength N = x = 1000 for D — 8 (filled circles) and 
for D — 10 (filled squares). The distribution is Gaussian for D — 8, but has a non-Gaussian skewness for D = 10. Note that 
the theoretically predicted critical dimension is D c — 8.69363 .... The histogram was formed by numerically splitting 5 x 10 5 
samples in each case. 

For the number of nodes tin, we again find a phase transition [15] driven by the same mechanism mentioned earlier. 
We find that while the average number of non-empty nodes n(N) — (tin) s=s 2V/D for large N where V — N D , 
the variance v(N) — ((tin — fJ-(N)) ) undergoes a phase transition at a critical value of D c = n/ sin -1 (l/y/8) = 
8.69363..., 



u(N) ~ V for D < D c 

-V 29 ^ for D>D C , 

where V — N D and we computed the critical exponent 9(D) > 1/2 exactly [15] 

'2tt s 



6(D) = 2 cos 



D 



(25) 



(26) 



which increases monotonically with D for D > D c . Furthermore, we computed numerically the full distribution of 
n n for different values of D and found that while the distribution is Gaussian for D < D c (a fact that can also be 
proved analytically), it becomes non-Gaussian for D > D c (see Fig. 8). As before, once we write down the moment 
equations, D can be treated as a continuous parameter though in actual vector data D represents the dimension of a 
vector element and therefore D is always an integer. 



V. CONCLUSION 



In this paper, we have demonstrated how a variety of techniques developed in statistical physics can be successfully 
used to understand the statistical properties of various search trees, in particular for the random m-ary search tree 
problem. Search trees are the basic objects in data storage and retrieval. Hence we expect that our results will 
have important consequences in the 'sorting and searching' area of computer science. Our approach, perhaps not 
rigorous in the strict mathematical sense, has the advantage that it provides a physically transparent derivation of 
asymptotic results and can be readily generalized to study different types of search trees. For example, the traveling 
front method has subsequently been used to study the so called 'digital search tree' that are used in the Lempel-Ziv 
data compression algorithm [16]. Besides, our approach has the beauty that it makes links between seemingly different 
problems and provides us with new results such as those for the vector data. We hope that the techniques discussed 
in this paper would be useful in future for studying other problems in computer science. 
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